Project Feature Selection, Model Selection and Tuning

Andres Delgadillo

1 Project: Travel Package Purchase Prediction

1.1 Objective

1.2 Data Dictionary

1.3 Questions to be answered

2 Import packages and turnoff warnings

3 Import dataset and quality of data

This first assessment of the dataset shows:

4 Exploratory Data Analysis

4.1 Pandas profiling report

We can get a first statistical and descriptive analysis using pandas_profiling

Pandas Profiling report is showing some warnings/characteristics in the data:

4.2 Univariate Analysis

4.3 Pairplot.

We are going to perform bivariate analysis to understand the relationship between the columns

4.4 Bivariate and Multivariate Analysis

There are several features with strong positive correlation:

There are some features with negative correlation

4.4.1 Months_on_book and Customer_Age

4.4.2 Gender and Attrition_Flag

4.4.3 Education_Level and Attrition_Flag

4.4.4 Marital_Status and Attrition_Flag

4.4.5 Income_Categoryx and Attrition_Flag

4.4.6 Card_Category and Attrition_Flag

4.4.7 Attrition_Flag and Continuous features

Observations

5 Data Pre-Processing

5.1 Feature Engineering

5.2 Outliers detection

We are going to analyze if there are outliers in the continuous features

Customer_Age

Dependent_count

Months_on_book

Total_Relationship_Count

Months_Inactive_12_mon

Contacts_Count_12_mon

Credit_Limit

Total_Revolving_Bal

Total_Amt_Chng_Q4_Q1

Total_Trans_Amt

Total_Trans_Ct

Total_Ct_Chng_Q4_Q1

Avg_Utilization_Ratio

5.3 Missing-Value treatment

Treatment

5.4 Data Preparation for Modeling

5.4.1 Training, validation and test sets

We are going to define Attrited Customers with 1 and Existing Customers with 0

5.4.2 Imputing Missing Values

Inverse map the encoded values

5.4.3 Creating Dummy Variables

6 Models evaluation criteria

6.1 Insights:

6.1.1 Model can make wrong predictions as:

  1. Predicting a customer closed the account but actually the customer keeps the credit card
  2. Predicting a customer keeps the credit card but actually the customer closed the account

6.1.2 Which case is more important?

6.1.3 How to reduce this loss?

6.2 Functions to evaluate models

7 Model Building

7.1 Logistic Regression model

Model performance by using KFold and cross_val_score

Performance on validation data

7.2 Decision Tree model

Model performance by using KFold and cross_val_score

Performance on validation data

7.3 Bagging Classifier model

Model performance by using KFold and cross_val_score

Performance on validation data

7.4 Random Forest model

Model performance by using KFold and cross_val_score

Performance on validation data

7.5 Gradient Boosting classifier model

Model performance by using KFold and cross_val_score

Performance on validation data

7.6 XGBoost classifier model

Model performance by using KFold and cross_val_score

Performance on validation data

7.7 Model Performance Summary

Now, we are going to compare all models using the Recall score in validation sets

We are going to use Recall as metric performance with the goal of reduce False Negatives

8.1 Tuned Decision Tree

Performance on validation data

8.2 Tuned Gradient Boosting classifier

Performance on validation data

8.3 Tuned XGBoost classifier

Performance on validation data

8.4 Model Performance Summary

Now, we are going to compare all models using the Recall score in validation sets

9 Model building - Oversampled data

We are going to fit the 3 tuned models on oversampled data

9.1 Decision Tree on oversampled data

Performance on validation data

9.2 Gradient Boosting on oversampled data

Performance on validation data

9.3 XGBoost classifier on oversampled data

Performance on validation data

9.4 Model Performance Summary

Now, we are going to compare all models using the Recall score in validation sets

10 Model building - Undersampled data

We are going to fit the 3 tuned models on undersampled data

10.1 Decision Tree on undersampled data

Performance on validation data

10.2 Gradient Boosting on undersampled data

Performance on validation data

10.3 XGBoost classifier on undersampled data

Performance on validation data

10.4 Model Performance Summary

Now, we are going to compare all models using the Recall score in validation sets

11 Model performances on test set

11.1 Model performance summary

Now, we are going to compare all models on test set

11.2 Feature importance of XGBoost Undersampled data model

12 Pipelines for productionizing the model

Now, we have a final model. let's use pipelines to put the model into production. The model chosen is the XGBoost classifier on undersampled data.

Therefore, the pipeline have 2 steps:

Now, create independent and dependent variables

Finally, fit the pipeline mode

13 Actionable Insights & Recommendations

13.1 Insights of the model

13.2 Recommendations for the bank